Phase transitions in optimal unsupervised learning 
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We determine the optimal performance of learning the orientation of the symmetry axis of a set of 
P = aN points that are uniformly distributed in all the directions but one on the A'^- dimensional 
space. The components along the symmetry breaking direction, of unitary vector B, are sampled 
from a mixture of two gaussians of variable separation and width. The typical optimal performance 
is measured through the overlap -Ropt = B • J* where J* is the optimal guess of the symmetry 
breaking direction. Within this general scenario, the learning curves Ropt{ci) may present first 
order transitions if the clusters are narrow enough. Close to these transitions, high performance 
states can be obtained through the minimization of the corresponding optimal potential, although 
these solutions are metastable, and therefore not learnable, within the usual bayesian scenario. 



PACS numbers : 87.10. +e, 02.5Q.-r, Q5.20.-y 



I. INTRODUCTION 

In this paper we address a very general problem in 
the statistical analysis of large amounts of data points, 
also called examples, patterns or training set, namely the 
one of discovering the structure underlying the data set. 
Whether this determination is possible or not depends 
on the assumptions one is willing to accept Sev- 
eral algorithms allowing to detect structure in a set of 
points exist. Among them, principal component analysis 
finds the directions of higher variance, projection pursuit 
methods j^] seek directions in input space onto which 
the projections of the data maximize some measure of 
departure from normality, whereas self-organizing clus- 
tering procedures |^ allow to determine prototype vec- 
tors representative of clouds of data. The parametric 
approach assumes that the structure of the probability 
density function the patterns have been sampled from is 
known. Only its parameters have to be determined given 
the examples. A frequent guess is that the probability 
density is either gaussian, or a mixture of gaussians. The 
process of determining the corresponding parameters is 
called unsupervised learning, because we are not given 
any additional information about the data, in contrast 
with supervised learning in which each training example 
is labelled. 

It has recently been shown that finding the principal 
component of a set of examples, clustering data with 
a mixture of gaussians, and learning pattern classifica- 
tion from examples with neural networks may be casted 
as particular cases of unsupervised learning Q|. In all 
these problems, the examples are drawn from a probabil- 
ity density function {pdfj with axial symmetry, and the 
symmetry-breaking direction has to be determined given 
the training set. As this direction may be found through 
the minimization of a cost function, the properties of 
unsupervised learning may be analyzed with statistical 



mechanics. This approach allows to establish the prop- 
erties of the typical solution, determined in the thermo- 
dynamic limit, i.e. the space dimension N —^ +oo, the 
number of examples P —^ +oo, with the fraction of ex- 
amples a = P/N constant. 

Besides these general results, the statistical mechanics 
framework allows to deduce the expression of an optimal 
cost function |^-|^ , whose minimum is the best solution 
that may be expected to be learnt given the data. The 
optimal cost function depends on the functional struc- 
ture of the pdf the examples are sampled from, and on 
the fraction a of available examples. Its main interest is 
that it allows to deduce the upper bound for the typi- 
cal performance that may be expected from any learning 
algorithm. On the other hand, Bayes' formula of sta- 
tistical inference allows to determine the probability of 
the symmetry breaking direction given the training set. 
Sampling the direction with Bayes probability is called 
Gibbs learning |^ . The average of the solutions obtained 
through Gibbs learning, weighted with the correspond- 
ing probability, is called bayesian solution. It is widely 
believed that the bayesian solution is optimal. Moreover, 
this has been so in all the scenarios considered so far. 

In the present paper, we consider a very general two- 
cluster scenario, which contains results already reported 
as particular cases. In fact, two different situations, in 
which the pattern distribution is a gaussian of zero mean 
and unit variance in all the directions but one, have been 
considered so far: a gaussian scenario |^ and a two- 
cluster scenario JlO| , pd| P]. In the former, the components 
of the examples parallel to the symmetry-breaking di- 
rection are sampled from a single gaussian. In the lat- 
ter these components are drawn from a mixture of two 
gaussians, each one having unit variance. The learning 
process has to detect differences between the pdf along 
the symmetry-breaking direction and the distributions in 
the orthogonal directions. Several ad hoc cost functions 
allowing to determine the symmetry-breaking direction 



1 



have been analyzed for both scenarios. Typically, if the 
pdf has a non-zero mean value in the symmetry-breaking 
direction, learning is " easy" : the quality of the solution 
increases monotonically with the fraction a of examples, 
starting at a = 0. In contrast, if the pdf has zero mean, 
the deviations of the pdf along the symmetry breaking 
direction from the pdf in the orthogonal directions de- 
pend on the second and higher moments. In this 
phenomenon called retarded learning ^ appears: learn- 
ing the symmetry-breaking direction becomes impossible 
when the fraction of examples falls below a critical value 
etc- 

Since we have considered the case of clusters of variable 
width, we could determine the entire phase diagram of 
the two-cluster scenario. Several new learning phases ap- 
pear, depending on the mean and the variance of the clus- 
ters. In particular, if the second moment of the individual 
clusters is smaller than the second moment of the pdf in 
the orthogonal directions, first order transitions from low 
to high performance learning may occur as a function of 
a. Close to these, high performance metastable states 
exist above the stable states of Gibbs learning, in the 
thermodynamic limit. One of the most striking results 
of this paper is that these high performance metastable 
states can indeed be learnt through the minimization of 
an optimal a-dependent potential, although they cannot 
be obtained through bayesian learning. 

Our results have been obtained within the replica ap- 
proach with the replica symmetry hypothesis. We show 
below that this assumption is equivalent to the more 
intuitive requirement that the optimal learning curves 
^^opt(Q!) are increasing functions of the fraction of exam- 
ples a. To our knowledge, this fact has not been noticed 
before. 

The paper is organized as follows: a short presentation 
of the problem and th e re plica calculation are given in 
section ||. In section III we deduce the optimal cost 



functions within the replica symmetry hypothesis, as well 
as the condition of replica symmetry stability. In section 
IV we deduce and discuss the optimal learning curves for 
the general two-cluster scenario. The typical properties 
of the optimal cost functions in the complete range of a, 
presented in section ^ show that bayesian learning may 
not be optimal. Finally, the complete phase diagram is 
described in section as a function of the two clusters' 
parameters. 



II. GENERAL FRAMEWORK AND REPLICA 
CALCULATION. 

We consider the general case of iV-dimensional vectors 
^, the patterns or examples of the training set, drawn 
from an axially symmetric probability density P*{$, |B) 
of the form: 



P*(^|B) 



1 



exp 



V*iX) 



where B is a unitary vector in the symmetry-breaking 
direction, i.e. B • B = 1 (notice that this is not the usual 
convention), and A = ^ • B = J^i^i^i- According to 
(0), the patterns have normal distributions i.e. P{x) = 
exp(— a;^/2)/\/27r onto the — 1 directions orthogonal 
to B. The distribution ([l]) in the symmetry-breaking 
direction is 



P*(A) = 



1 



/2ti 



cxp 



-^-v*{\) 



(2) 



Thus, V*{\) introduces a modulation parallel to B; if 
V* = Q the patterns' distribution is normal in all the 
directions. Normalization of P* requires: 



D\ exp [-V*{\)] = 1 



(1) 



(3) 



where D\ = exp(— A^/2)dA/ \/27r. The different moments 
(A") of (1) are: 

(|-B)"P*(^|B)rf^= / A"P*(A)dA. (4) 

Several examples of functions V* have been treated 
in the litterature so far fliR . In the particular case 
of supervised learning of a linearly separable classifica- 
tion task by a single unit neural network, the symmetry- 
breaking direction B is the teacher's vector, orthogonal 
to the hyperplane separating the classes. The class of 
pattern ^ is r = sign{R ■ ^). The corresponding pdf is 
P*(rA) = 2e(TA) exp(-AV2)/\/2¥, i.e. V*{X) = -ln2 
for rA > and +oo for tA < 0. 

In the following, we concentrate on the problem of un- 
supervised learning. We are given a training set Ca = 
{i^}fj.=i,...,p of P = aN vectors sampled independently 
with probability density P*{^ |B). We have to learn the 
unknown symmetry-breaking direction B from the ex- 
amples knowing the functional dependence of P* on B. 
Using B ayes' rule of inference, the probability of a direc- 
tion J (with J • J = 1) given the data is: 

pi-ti^c) = I n^^p • ^'V2 - v^*(^^ • J)} Po{j), 

(5) 

where ^b(J) = 6{3-J — l) is the assumed prior probability 
and Z = JdJ exp {-e ■ 1^2 - V*{e ■ J)} PoW is 
the probability of the training set. By analogy with su- 
pervised learning, sampling the direction with probabil- 
ity (||) is called Gibbs learning 

We consider learning procedures where the direction 
J is found through the minimisation of a cost function 
or energy E{3;£a)- As the patterns are independently 
drawn, this energy is an additive function of the exam- 
ples. The contribution of each pattern ^'^ to E is given 
by a potential V that depends on the direction J and on 
through the projection (called local field) 7^ J • ^^: 
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(6) 



As the training set only carries partial information on 
the symmetry-breaking direction B, the direction J de- 
termined by the minimization of (^) will generally differ 
from B. The quality of a solution J may be caracterized 
by the overlap i? = B- J. Ifi? = 0, J does not give 
any information about the symmetry-breaking direction. 
Conversely, if i? = 1 the symmetry-breaking direction is 
perfectly determined. 

The statistical mechanics approach allows to calculate 
the expected overlap R{a) for any general distribution 
V* and any general potential V, in the thermodynamic 
limit N,P ^ +00 with a = P/N finite. In this limit, 
we expect that the energy is self-averaging: its distri- 
bution is a delta peak centered at its expectation value 
independently of the particular realization of the training 
patterns. Given the modulation V* , different values of 
R may be reached, depending on the potential used for 
learning. In the following, we sketch the main lines that 
allow to derive the typical value of R corresponding to a 
general potential V. 

The free energy F corresponding to the energy (|^) with 
a given potential ^(7) is 



Fil3,N,Ca) = --lnZ{(3,N,Cc.), 



(7) 



where f3 is the inverse temperature and Z the partition 
function: 



Z{(3,N,£a)^ J dJ exp{~(3E{J;Ca)}S{J^ 



(8) 



As mentioned before, in the thermodynamic limit the free 
energy is self-averaging, i.e.: 



lim 1f(/3,7V,£, 



1 



hm -FiP,N,C^) (9) 



where (...) stands for the average over all the possible 
training sets. The average in the right hand side of eq. 
is calculated using the replica method: 



InZ = lim — In Z^' 



(10) 



which reduces the problem of averaging In Z to the one 
of averaging the partition function of n replicas of the 
original system, and taking the limit n 0. The proper- 
ties of the minimum of the cost function are those of the 
zero temperature limit (/3 +00) of the free energy. In 
the case of differentiable potentials V, the integrals are 
dominated by the saddle point, and the zero temperature 
free energy writes B : 



/(i?,c) - hm 



1 



lim -F(A7V,£„) 



(11) 



2c 



1 - i?^ - 2a / DtW{t;c) 



X Dz exp[-F*(A)] 



where 



A = zy/l -i?2 + Rt. 



(12) 



In ( |ll| ) , R is the overlap between the symmetry-breaking 
direction B and a minimum J of the cost function (|^); 
c = lim^^-|_oo /3(1 — q) where q is the overlap between 
minima of the cost function (^ for two different replicas, 
and 



W{t; c) = min^ [0^(7) + (7 - t f/2] 



(13) 



is the saddle point equation. The extremum conditions 
of the free energy ( pT| ) with respect to R and c, df/dR = 
df /dc = 0, give the following equations for R and c: 



/-too 
Dt [7(i;c) 
-00 

/+ca 
Dz exp[-y*(A)] , (14a) 
-00 

/+00 
Dt [7(t; c) - t] 
-00 

X) 

X / Dzz exp[-V^*(A)] , (14b) 



where A is defined in ( p^ ) and 7(^; c) is the solution that 
minimizes (|l3|). Introduction of (|l^) into ([III) gives the 
free energy at zero temperature: 

fiR,c)^a J DtV{-f{t;c)) J Dz e^j>[-V* (X)] . (15) 

If the potential ^(7) is not convex, eq. (|l^) may have 
more than one solution. In that case, the one minimizing 
( [Tsl ) with respect to R should be kept. 

These results were obtained under the assumption of 
replica symmetry. A necessary condition for the replica 
symmetry hypothesis to be satisfied is: 



Dt [7'(t;c) -if Dz exp[-V^*(A)] < 1, 

-00 J —00 

(16) 

with 7'(i;c) = d-f/dt. 



III. OPTIMAL POTENTIAL AND REPLICA 
SYMMETRY STABILITY CONDITION. 

Given any modulation V*, the typical overlap R ob- 
tained through the minimization of a differentiable po- 
tential V may be determined as a function of a by solving 
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equations (14). The result is consistent if condition jl^ ) 
is verified. In this section, we are interested on the best 
performances that may be expected. Recently, a gen- 
eral expression for the optimal potential allowing to find 
the solution with maximum overlap i?opt has been de- 
duced jj] . This optimal potential Vopt depends implicitly 
on a through i?opt(a), and on the probability distribu- 
tion P* through the modulation V* . It was obtained un- 
der the assumption of replica symmetry, which has been 
shown to be correct for the particular cases investigated 
so far. In fact, the stability condition of replica symmetry 
for optimal learning is verified whenever the slope of the 
learning curves is positive, as will be shown below. For 
the sake of completness, we first describe an alternative 
derivation of the optimal potential. Following the same 
lines we used for supervised learning |^], Vopt is deter- 
mined through a functional maximization of R, given by 
eq. (14), with respect to V at constant a. As discussed 
in H], the parameter c sets the energy units and may be 
arbitrarily chosen. We used c ~ 1 throughout, without 
any lack of generality. After a straightforward calcula- 
tion we obtain that the optimal overlap i?opt is given by 
the inversion of: 



a(-Ropt) - ^opt 



'■+^ ^^[JDzzexp{-V*{X))y 



JDzexp (-y*(A)) 



(17) 



where A, given by (|l|), writes A = zy'l - Rl^^ + i?opt t. 

Notice that eq. (p7| ) may be not invertible, i.e., i?opt(a) 
may be multivalued. In this case, the correct solution 
has to be selected. 

Vopt is determined through the integration of: 



Kpt(7opt 



{t)) = 



1-i? 



opt 



p2 
-^opt 



dt 



In 



i:»zexp(-l^*(A)) 

(18) 



where the argument of V^pt is given by the saddle-point 
equation (Oh with c — 1, i.e: 



lopt{t) = i- 



K'pt (7opt 



it))- 



(19) 



Since R is parametrized by a, the cost function leading 
to optimal performance is different for different training 
set sizes. 

Eq. ([l^) and (|l^) were previously derived by Van 
den Broeck and Reiman , who showed that the typical 
overlap i?b of bayesian learning satisfies the same equa- 
However, this only guarantees that 



tion 



(^ as R. 



opt ■ 



bayesian learning is optimal if eq. ( |17D is invertible. In 
that case its unique solution is i?b — Ropt- Otherwise, as 
is discussed in the example of section [V , solutions with 
^opt > Rh rnay exist. 

The results derived so far are valid under the replica 
symmetry hypothesis, and must thus satisfy (0). Taking 



( [TtI ) and ( [l^ ) into account, a cumbersome but straight- 
forward calculation gives: 



1 - a / Dt [j'{t; c) - lY I Dz cxp [~V*{\)] 

J —OO 

_ ^opt(l - -Ropt) da[Ropt) 



+ 00 



C^^opt 



(20) 



Therefore, in the case of optimal learning, the necessary 
condition of replica symmetry stability (|l6| ) is equiva- 
lent to the natural requirement that the learning curve 
-Ropt(Q^) is an increasing function of the fraction of ex- 
amples a for i?opt 7^ 0, 1. This relation, which does not 
seem to have been noticed before, is independent of the 
distribution (|l|) the data set is sampled from. 

In the cases where the analytic function a {Ropt) given 
by ( p^ is not invertible, only the branches with posi- 
tive slope have to be considered, as they trivially satisfy 
the replica symmetry condition. Examples of such a be- 
haviour are shown in next section. 

Hence, given any modulating function V* sufficently 
derivable, as far as i?opt 0, 1 there exists an opti- 
mal potential Vopt{l), consistent with the assumptions 
of the replica calculation, which depends implicitly on a 
through Ropt{a)^ and on V* . The minimum J* of the 
corresponding energy (^) maximizes the overlap R be- 
tween J* and the symmetry-breaking direction B. 

The development of a (i?opt ) for small i?opt shows that 
-Ropt > for all a > if and only if (A) 7^ 0. In that 
case, for a <C 1, i?opt ~ (A)v^i like with Hebb's learning 
rule 1^. If (A) = 0, two different behaviours may arise: 
either a continuous transition from i?opt = to -Ropt ~ 
y/a — Uc occurs at Uc = (1 — (A^))~^, or the overlap 
jumps from i?opt = to i?opt > through a first order 
transition at ai < a^. In particular, if (A^) = 1, only 
a discontinuous transition may occur since ac — +00. 
Discontinuities between two finite values of i?opt also may 
arise for a > ac. AH these phase transitions appear in 
the two-cluster scenario that we analyze in next section. 



IV. A CASE STUDY: TWO-CLUSTER 
DISTRIBUTIONS. 

Consider the general two gaussian-clusters scenario, in 
which the modulation along the symmetry breaking di- 
rection (H) is: 



1 



1= 7 exp 

" e=±l 



(A + ep)^ 



(21) 



This distribution is a generalization of the one studied by 
Watkin and Nadal |^, who considered optimal learning 
for clusters with a = 1. If p = 0, ( pT| ) corresponds to the 
single gaussian scenario studied by Reimann et al. Q . In 
this paper we investigate the complete phase diagram in 
the plane p, a. 
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The first two moments of (21) are 



(A)= 



2 I 2 
p + (T . 



(22) 
(23) 



Tlius, if (7 = 1 only distributions with (A^) > 1 are con- 
sidered. The optimal solution in that case is close to the 
one obtained with a quadratic potential Quadratic 
potentials detect the direction extremizing the variance 
of the training set, which we call variance learning. We 
show below that the optimal overlap may be much larger 
than the one obtained through variance learning if the 
clusters have a < 1. 

Introducin g th e expression of V* obtained from (pi 



and (H) into (|l7| ) gives a as a function of i?opt ■ It turns 
out that, for some values of a, this function has three dif- 
ferent roots for i?opt(a), as is apparent on figures |l| and ||. 
The one lying on the branch with negative slope violates 
the assumption of replica symmetry. The two others cor- 
respond to minimae of the corresponding free energies. 
Figures |l], || and || show the optimal learning curves for 
several values of p and a in the range not investigated 
before. The two branches i?opt(a) with positive slope 
that satisfy condition (pj|), and the dotted line of neg- 
ative slope (inconsistent with the assumption of replica 
symmetry), are presented for illustration. The value of a 
at which the jump from one branch to the other occurs 
is discussed in next section. The performance obtained 
through learning with simple quadratic potentials is also 
presented, to show the dramatic improvement of opti- 
mal learning with respect to variance learning for double 
clusters with a < 1. 



V. BAYESIAN VERSUS OPTIMAL SOLUTIONS. 



As pointed out in section III, eq. (ITH may be deduced 



in two different ways: through the determination of the 
bayesian learning performance, or through functional op- 
timization. This procedure yields of a cost function for 
each training set size a whose minimum gives the solution 
with maximal overlap . 

The bayesian solution to the learning problem is given 
by the average of solutions sampled with Gibbs' prob- 
ability. A simple argument (D shows that the typical 
bayesian performance satisfies i?b = •\/i?Gj where Rg is 
the typical overlap between a solution drawn with prob- 
ability (|^) and the symmetry- breaking direction B. Rq 
minimizes the free energy with potential V{-y) ~ ^*iri) 
at inverse temperature /? = 1 ■ 

As eq. ( p^ ) is satisfied both by i?b and i?opt, it is 
tempting to conclude that bayesian learning is optimal. If 
eq. (0) has a unique solution, this is obviously the case. 
However, equation ( |l7|) may not be invertible. This arises 
in the two-cluster scenario presented in the previous sec- 
tion, where two branches of solutions consistent with the 
assumption of replica symmetry exist for some values of 



a. In the case of bayesian learning, these branches re- 
sult from the fact that Gibbs' free energy has two lo- 
cal minima as a function of R. Rq, the thermodinami- 
cally stable state, corresponds to the absolute minimum. 
When a changes, Rq jumps from one branch to the other 
through a first order phase transition at a = oq, where 
both minima have the same free energy ||l2|. Therefore 
the bayesian solution, which is the average of the solu- 
tions sampled with Gibbs' probability, presents a jump 
at the same value aQ as Gibbs' performance. Thus, the 
metastable states of higher performance than i?b, which 
exist for a < ac, cannot be obtained through bayesian 
learning. 

On the other hand, in section III we determined op- 
timal potentials whose minimization allow to obtain 
performance -Ropt- These potentials exist for all the 
pairs (a,i?opt(a)) lying on the monotonically increas- 
ing branches of i?opt(a), which satisfy the hypothesis of 
replica symmetry. Potentials allowing to reach the per- 
formances of the upper (Gibbs-metastable) branch thus 
exist. It should be noticed that we cannot determine the 
position of the jump of i?opt through the comparison of 
the free energies corresponding to solutions on different 
branches at the same a, as was done to determine ac, 
because a different potential has to be minimized for each 
pair ( a, i?opt(Q;) ) and, as discussed in section III, these 
potentials are measured in the arbitrary units determined 
by our choice c = 1 . 

In order to clarify this problem, we studied the perfor- 
mance of the minima of the optimal potentials. In fact, 
the properties of each of the potentials K>pt(A) may be 
determined for any value of a (besides the value for which 
it has been optimized) in the same way as those of other 
ad hoc potentials, by solving numerically eq. (|lj). Figs. 
^ and H presents several learning curves R[a) obtained 
with potentials Vi^pt optimized for ovarlaps lying on the 
upper metastable branch of Gibbs' learning. They cor- 
respond to the same clusters' parameters as figs. || and 
^. Each learning curve is tangent to the optimal learning 
curve at the point ( Q;(i?opt), Ji!opt ) at which the poten- 
tial was determined. This result holds in particular for 
all the points lying on the high-performance metastable 
branch of bayesian learning, i.e. for ai < a < uq. 
It is important to point out that the free energy ( pT| ) 
presents a unique replica symmetric minimum as a func- 
tion of R for all these potentials. Thus, these results 
show that the corresponding optimal potentials Vo-pt al- 
low to select, among the metastable states of Gibbs learn- 
ing, the one of largest overlap. In particular, the Gibbs' 
metastable states in the upper branch for a < aQ are 
learnable through the minimization of the corresponding 
optimal potential. Thus, in the range ai < a < q;g 
bayesian learning is not optimal. This surprising behav- 
ior may arise whenever the curve Rq (a) of Gibbs learning 
presents first order phase transitions. 

It is worth noting that, besides the solutions that ver- 
ify the replica symmetric condition ( p^ ) , solutions unsta- 
ble under replica symmetry breaking with smaller R and 
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slightly higher free energy also exist. The nature of these 
states is very different from that of the metastable states 
of Gibbs learning. Whether the typical performance in 
the case of the double cluster distributions is the one de- 
scribed by the replica symmetric solution or not remains 
an open problem. 

VI. THE PHASE DIAGRAM 

In this section we describe, on the p ~ a plane, all the 
possible learning phases that may arise in unsupervised 
learning within the two gaussian-clusters scenario. As 
shown on fig. ^, depending on the values of p and cr, 
qualitatively different behaviours of the learning curves 
fiopt [ol] may appear. They are correlated with the form 
of the corresponding optimal potentials. 

The regions marked with an "S" are regions of 
variance-type learning: the optimal potential is a sin- 
gle well with Fopt +00 for A ±00 if cr^ < 1, and 
Vo^t — > — 00 for A ^ ±00 if cr^ > 1. In these regions, the 
learning curves increase monotonically with a, starting 
at Qfc = l(A^) — like for quadratic potentials 

For parameter values outside the "S" regions, Kpt 
-|-oo for A — > ±00, even in the large variance region 
(A^) > 1 where naively one would expect the potential 
to have the same asymptotic behaviour as for cr^ > 1. 
Depending on the value of i?opt, the optimal potential 
may be a double- well function of the local field 7. In 
the latter case, the optimal learning strategy looks for 
structure in the data distribution rather than for direc- 
tions extremizing the variance. This is more striking on 
the line (A^) = 1 corresponding to distributions with the 
same second moment in all the directions. On this line, 
variance learning is impossible and = 00. However, 
in the entire light-grey region including this line, per- 
formant learning is achieved if the adequate potential is 
minimized. The optimal overlap presents jumps from 
-Ropt = to finite- i? at a fraction of examples a < dc- 
In the high-performance branch, the optimal potential is 
double-well, with the two minima close to ±p, as shown 
on figure |^. Thus, the potential is sensitive to the two 
cluster structure, and its minimization results in high 
performance learning. For p and a in the dark-grey re- 
gions, a first order transition to large i? also takes place, 
but for a > ac- Below the transition, optimal learning is 
mainly controlled by the variance of the training set. 

In the white regions on both sides of the dark-grey 
ones, no first order phase transitions to high performance 
learning occur as a function of a. In the white region just 
below the dark-grey one, the potential changes smoothly 
from a single to a double well with increasing i?opt- The 
two minimae appear at 7 = 0, and move away with in- 
creasing i?opt, as shown on fig. However, as far as 
these minimae are not sufficiently apart, i?opt remains 
close to the values obtained with simple quadratic poten- 
tials. Conversely, in the upper white region, which corre- 



sponds to (A^) 1 the minima of the optimal potential 
are far appart, in a region of large local fields, where the 
patterns' distribution is vanishingly small. Thus, in the 
range of pertinent values of 7 the potential is concave 
(Kijpt < 0)1 here also, like in the lower white re- 
gion, the values of i?opt are close to those obtained with 
quadratic potentials 0]. 

VII. CONCLUSION 

Learning the symmetry-breaking direction of a distri- 
bution of patterns with axial symmetry in high dimen- 
sions is a difficult problem. In this paper we determined 
the optimal performances that may be reached if the pat- 
terns distribution has a double-cluster structure in the 
symmetry-breaking direction. Depending on the clus- 
ters' size and separation, the learning curves may present 
several phases with increasing a, including novel first or- 
der transitions from low-performance variance learning to 
high-performance structure detection. We showed that 
when the optimal learning curves present such disconti- 
nuities, bayesian learning may be not optimal. These re- 
sults rely on the assumption that the solution with replica 
symmetry is the absolute minimum of the free energies 
studied. Although we showed that our solutions satisfy 
the replica symmetry stability condition, we cannot rule 
out the existence of states of lower energy, but having 
broken replica symmetry. 
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FIG. 2. Learning curves for the two-clusters scenario, for 
cluster parameters corresponding to the central small square 
of fig. ^. Full line: optimal learning. Dash-dotted lines: 
lower branch of metastable solutions to optimal learning. 
Also shown: the replica-symmetry unstable curve (dotted 
line). The lowest dashed line corresponds to learning with 
a quadratic potential (variance-learning). Here, ai — 2.49, 
Ropt{cei) = 0.76; the bayesian first order transition occurs 
at aa — 2.52, -Ropt(Q!G) ~ 0.81; the critical a for vari- 
ance-learning is Uc = 2.10. 
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FIG. 3. Optimal learning curves (full line) for the 
two-clusters scenario, for cluster parameters corresponding 
to the upper small square of fig. ^ The lowest dashed 
line corresponds to learning with a quadratic potential (vari- 
ance-learning). Here, Oc = 0.68. 
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FIG. 4. Learning curves for p — 1.1, a — 0.5 obtained 
with the optimal potentials correponding to -Ropt ~ 0.84, 
-Ropt = 0.87 and iiopt = 0.90 (full lines). Only the solutions 
consistent with the replica symmetry hypothesis are shown. 
Dotted Unes: optimal solution. 
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FIG. 5. Learning curves for p — 1.2, a — 0.5 obtained 
with the optimal potentials correponding to -Ropt ~ 0.76, 
-Ropt = 0.79 and -Ropt = 0.81 (full lines). Only the solutions 
consistent with the replica symmetry hypothesis are shown. 
Dotted lines: optimal solution. 
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FIG. 6. Phase diagram of the two-cluster scenario. The 
three small squares correspond to the learning curves of figs. 
01 and |. 
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FIG. 7. Potentials for optimal learning in the grey regions 
of the phase diagram, showing the evolution of the separation 
between minima with a. 
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FIG. 8. Potentials for optimal learning in the white regions 
of the phase diagram, showing the appearence of the two min- 
ima that get farther apart with increasing a. 
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